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(57) Abstract 



A search engine is disclosed which suggests related terms to the user to allow the user to refine a search. The related terms are 
generated using query term correlation data which reflects the frequencies with which specific terms have previously appeared within the 
same query. The correlation data is generated and stored in a look-up table (137) using an off-line process (136) which parses a query 
log file (135). The table (137) is regenerated periodically from the most recent query submissions (eg, the last two weeks of query 
submissions), and thus strongly reflects the current preferences of users. Each related term is presented to the user via a respective hyperlink 
(910) which can be selected by the user to submit a modified query. In one embodiment, the related terms are added to and selected from 
the table (137) so as to guarantee that the modified queries will not produce a NULL query result 
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SYSTEM AND METHOD FOR REFINING SEARCH QUERIES 
BACKGROUND OF THE INVENTION 

FieW of Invention 

This present invention relates to query processing, and more specifically relates to techniques for facilitating 
the process of ref ining search queries. 
Description of Related Art 

Whh the increasing popularity of the Internet and the World Wide Web, it is common for on-line users to 
utilize search engines to search the Internet for desired information. Many web sites permit users to perform searches 
to identify a small number of relevant items among a much larger domain of items. As an example, several web index 
shes permit users to search for particular web sites among known web sites. Similarly, many on-line merchants, such 
as booksellers, permit users to search for particular products among all of the products that can be purchased from the 
merchant Other on-line services, such as Lexis™ and Westlaw™, allow users to search for various articles and court 
opinions. 

In order to perform a search, a user submits a query containing one or more query terms. The query may also 
explicitly or impficrtty identify a record field or segment to be searched, such as title, author, or subject classification of 
the item. For example, a user of an on-line bookstore may submit a query containing terms that the user believes 
appear within the title of a book. A query server program of the search engine processes the query to identify any 
items that match the terms of the query. The set of items identified by the query server program is referred to as a 
'quay result" In the on-line bookstore example, the query result is a set of books whose titles contain some or all of 
the query terms. In the web index site example, the query result is a set of web sites or documents. In web-based 
inplementations, the query result is typically presented to the user as a hypertextual listing of the located items. 

If the scope of the search is large, the query result may contain hundreds, thousands or even millions of 
items. If the user is performing the search in order to find a single item or a smal set of items, conventional 
approaches to ordering the items within the query result often fail to place the sought item or items near the top of the 
quay resuh list This requires the user to read through many other items in the query result before reaching the 
sought item. Certain search engines, such 2$ Excite™ and AltaVista™, suggest related query terms to the user as a 
part of the "search refinement" process. This allows the user to further refine the query and narrow the query result 
by selecting one or more related query terms that more accurately reflect the user's intended request The related 
quay terms are typically generated by the search engine using the contents of the query result, such as by identifying 
the most frequently used terms within the located documents. For example, if a user were to submit a query on the 
term "FOOD," the user may receive several thousand Kerns in the query result The search engine might then trace 
through the contents of some or all of these items and present the user with related query terms such as 
"RESTAURANTS," "RECIP1ES," and "FDA" to allow the user to refine the query. 

The related query terms are commonly presented to the user together with corresponding check boxes that 
can be selectively marked or checked by the user to add terms to the query. In some implementations, the related 
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query terms are alternatively presented to and selected by the user through drop down menus that are provided on the 
query result page. In either case, the user can add additional terms to the query and then re-submit the modified query. 
Using this technique, the user can narrow the quay result down to a more manageable set consisting primarfly of 
relevant items. 

5 One problem with existing techniques for generating related query terms is that the related terms are 

frequently of little or no value to the search refinement process. Another problem is that the addition of one or more 
related terms to the query sometimes leads to a NULL query result. Another problem is that the process of parsing the 
query result items to identify frequently used terms consumes significant processor resources, aid can appreciably 
increase the amount of time the user must wait before viewing the query result. These and other deficiencies in 

10 existing techniques hinder the user's goal of quickly and efficiently locating the most relevant items, and can lead to 
user frustration. 

SUMMARY OF THE INVENTION 
The present invention addresses these ami other problems by providing a search refinement system and 
method for generating and displaying related query terms ("related terms"). In accordance with the invention, the 

15 related terms are generating using query term correlation data that is based on historical query submissions to the 
search engine. The query term correlation data ("correlation data") is preferably based at least upon the frequencies 
with which specific terms have historically been submitted together within the same query. The incorporation of such 
historical query information into the process tends to produce related terms that are frequently used by other users in 
combination with the submitted query terms, and significantly increases the likelihood that these related terms will be 

20 helpful to the search refinement process. To further increase the Bkeihood that the related terms will be helpful, the 
correlation data is preferably generated only from those historical query submissions that produced a successful query 
result (at least one match). 

In accordance with one aspect of tin invention, the correlation data is stored in a correlation data structure 
(table, database, etc) which is used to look up related terms in response to query submissions. The data structure is 

25 preferably generated using an off-line process which parses a query log file, but could afternatively be generated and 
updated in real-time as queries are received from users. In one embodiment, the data structure is regenerated 
periodically (e.g., once per day) from the most recent query submissions (e.g., the last M days of entries in the query 
log), and thus strongly reflects the current tastes of the community of users; as a result the related terms suggested 
by the search engine strongly reflect the current tastes of the community. Thus, for example, in the context of a 

30 search engine of an online merchant the search engine tends to suggest related terms that correspond to the current 
best-selling products. 

In a preferred embodiment each entry in the data structure is in the form of a key term and a corresponding 
related terms list. Each related terms list contains the terms which have historically appeared together (in the same 
query) with the respective key term with the highest degree of frequency, ignoring unsuccessful query submissions 
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(query submissions which produced a NULL query result). The data structure thus provides an efficient mechanism for 
looking up the related terms for a given query torn. 

To generate a set of related terms for refining a submitted query (the "present query*), the related terms fist 
for each torn in the present query is initially obtained from the correlation data structure. If this step produces 
multiple related terms lists (as in the case of a multiple-term query), the related terms lists are preferably combined by 
taking the intersection between these fats (Le., deleting the terms that are not common to all lists). The related terms 
which remain are terms which have previously appeared, in at least one successful query submission, in combination 
with every term of the present query. Thus, assuming items have not been deleted from the database being searched, 
any of these related terms can be individually added to the present query while guaranteeing that the modified query 
wilt not produce a NULL query resuK. To take advantage of this feature, the related terms are preferably presented to 
the user via a user interface that requires the user to add no more than one related term per query submission. In other 
embodiment the related terms are selected and displayed without guaranteeing a successful query result 

Because the related terms are identified from previously-generated correlation data, without the need to 
parse documents or correlate terms, the related terms can be identified and presented to the user with little or no 
added delay. 

BRIEF DESCRIPTION OF THE DRAWINGS 
These and other features wll now be described with reference to the drawings summarized below. These 
drawings and the associated description are provided to illustrate a preferred embodiment of the invention, and not to 

feint the scope of the invention. 

Throughout the drawings, reference numbers are re-used to indicate correspondence between referenced 
elements. In adifition, the first cfigrt of each reference number iraficates the figure in which the element first appears. 

Figure 1 illustrates a system in which users access web site information via the Internet and illustrates the 
basic web site components used to implement a search engine which operates in accordance with the invention. 

Figure 2 illustrates a sample book search page of the web site. 

Figure 3 illustrates sample log entries of a daily query log file. 

Figure 4 illustrates the process used to generate the correlation table of Figure 1. 

Figure 5A illustrates a sample mapping before a query is added. 

Figure 5B illustrates a sample mapping after a query is added. 

Figure 6 illustrates a process for generating the correlation table from the most recent daily query log files. 
Figure 7 illustrates a process for selecting the related query terms from the correlation table. 
Figure 8A illustrates a set of related query terms from a single-term query. 

Figure 8B illustrates a set of intersecting terms and a set of related query terms from a multiple-term query. 
Figure 9 illustrates a sample search result page of the web site. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
The present invention provides a search refinement system and method for generating related query terms 
('related terms') using a history of queries submitted to a search engine by a community of users. Briefly, the system 
generates query term correlation data which reflects the frequency with which specific terms have previously occurred 

5 together within the same quay. The system uses the query term correlation data in combination with the query 
term(s) entered by the user to recommend additional query terms for refining the query. The incorporation of such 
historical query information into the process tends to produce related terms that are frequently used by other users in 
combination with the submitted query terms, and significantly increases the likefihood that these related terms will be 
helpful to the search refinement process. To further increase the likefihood that the related terms wifl be helpful, the 

10 correlation data is preferably generated only from those historical query submissions that produced a successful query 
restdt (at least one match). 

In the preferred embotfiment the query term correlation date is regenerated periodically from recent query 
submissions, such as by using the last M days of entries in a query log, and thus hearty reflects the current tastes of 
users. As a result the related terms suggested by the search engine tend to be terms that correspond to the most 

15 frequently searched items during the relevant time period. Thus, for example, in the context of a search engine of an 
online merchant the search engine tends to suggest related terms that correspond to the current best-selling products, 
to one embodknent the technique used to generate the related terms and present these terms to the user guarantees 
that the modified query will not produce a NULL query result 

The search refinement methods of the invention may be implemented, for example, as part of a web site, an 

20 Internet site, an on-fine services network, a document retrieved system, or any other type of computer system that 
provides searching capabilities to a community of users. In addition, the method may be combined with other methods 
for suggesting related tarns, such as methods which process the contents of located documents. 

A preferred web-based implementation of the search refinement system wil now be described with reference 
to Figures 1-9. For purposes of illustration, the system is described herein in the context of a search engine that is 

25 used to assist customers of Amazon.com Inc. in locating items (e.g., books, CDs, etc) from an on-line catalog of 
products. Throughout the description, reference will be made to various implementation-specific details of the 
Amazon.com implementation. These details are provided in order to fully illustrate a preferred embodknent of the 
invention, and not to limit the scope of the invention. The scope of the invention is set forth in the appended claims. 
I. Overview of Web Site and Search Ermine 

30 Figure 1 illustrates the Amazon.com web site 130, including components used to implement a search engine 

in accordance with the invention. 

As it is well known in the art of Internet commerce, the Amazon.com web site includes functionality for 
allowing users to search, browse, and make purchases from an on-line catalog of book titles, music titles, and other 
types of items via the Internet 120. Because the catalog contains millions of items, it is important that the site 
35 provide an efficient mechanism for assisting users in locating items. 
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As illustrated by Figure 1, the web site 130 includes a web server application 131 ("web server") which 
processes user requests received from user computers 110 via the Internet 120. These requests include queries 
submitted by users to search the on-line catalog for products. The web server 131 records the user transactions, 
inchufing query submissions, within a query log 135. In the embodiment depicted in Figure 1, the query log 135 

5 consists of a sequence of daily query log files 135(1)-135(M), each of which represents one day of transactions. 

The web site 1OT also includes a query server 132 which processes the queries by searching a bibliographic 
database 133. The bibliographic database 133 includes information about the various products that users may 
purchase through the web site 1 30. This information includes, for example, the titles, authors, publishers, subject 
descriptions, and ISBNs (International Standard Book Numbers) of book titles, and the titles, artists, labels, and music 

10 classifications of music titles. The information for each item is arranged within fields (such as an "author" field and a 
"title" field), enabling the bfchographic database 1 33 to be searched on a field-restricted basis. The site also includes a 
database 134 of HTML (Hypertext Markup Language) content which includes, among other things, product information 
pages which show and describe the various products. 

The query server 132 includes a related term selection process 139 which identifies related query terms based 

15 on query term correlation data stored in a correlation table 137. As depicted in Figive 1 and descried below, the 
correlation table 1 37 is generated periodical from the M most recent daiy query log ffles 1 35( 1 M 35(M) using an off -kne 
table gener atio n process 136. 

The web server 131, query server 132, table generation process 136, and database software run on one or 
more Unix n> -based servers and workstations (not shown) of the web site 130 although other types of platforms could 

20 be used. The correlation table 137 is preferably cached in RAM (random access memory) on the same physical 
machine as that used to implement the query server 132. To accommodate large numbers of users, they query server 
132 and the correlation table 137 can be replicated across multiple machines. The web she components that are 
invoked during the searching process are collectively referred to herein as a "search engine." 

Figure 2 illustrates the general format of a book search page 200 of the Amazon.com web site 130 that can 

25 be used to search the bibliographic database 1 33 for book titles. Users have access to other search pages that can be 
used to locate music titles and other types of products sold by the on-line merchant. The book search page 200 
includes author, title, and subject fields 210, 220, 240 and associated controls that allow the user to initiate field- 
restricted searches for book titles. Users can perform searches by first typing in the desired information into a search 
field 210, 220. 240 and then clicking on the appropriate search button 230, 250. The term or string of terms 

30 submitted to the search engine is referred to herein as the "query." Other areas of the web site ask the user to submit 
queries without limiting the terms to specific fields. 

When the user submits a query from the book search page 200 to the web site 130, the query sever 132 
applies the query to the bibliographic database 133, taking into account any field restrictions within the query. If the 
query result is a single item, the item's product information page is presented to the user. If the query result includes 
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multiple items, the list of items is presented to the user through a query result page which contains hypertextual inks 

to the items' respective product information pages. 

For multiple-term queries, the query server 132 effectively logically ANDs the query terms together to 

perform the search. For example, if the user enters the terns "JAVA" aid "PROGRAMMING" into the title field 220, 
5 the query server 132 will search for and return a Est of all herns that have both of these terms within the title. Thus, 

if any query term does not produce a match (referred to herein as a 'non-matching term"), the query wiD produce a 

NULL query result Presenting a NULL query result to the user can cause significant user frustratioa To reduce this 

problem, in this event the user may be presented with a list of items that are deemed to be "close matches." Although 

the search engine described herein togica§y ANDs the query terms together, it wOl be recognized that the invention can 
10 be applied to search engines that use other methods for processing queries. 

In accordance with the invention, the search engine uses the query term correlation data stored in the 

correlation table 137 to select the related terms that best match the user's query. The search engine then presents 

the related terms to the user, alowing the user to refine the search and enhance discovery of relevant information. 

The query term correlation data indicates relationships between query terms, and is used to effectively predict query 
15 terms that are likely to be helpful to the search refinement process, hi accordance with another aspect of the 

invention, the correlation table 137 preferably contains or reflects historical information about the frequencies with 

which specific query terms have appeared together within the same query. 

The general format of the correlation table 1 37 is fflustrated in Figure 1 . In the embodiment depicted in Figure 1 

and deserted in detai herein, the correlations between query terms are based s^ 
20 same query. As deserted below, other types of query term correlations can addtionatfy be used. In addhkm, although the 

dsdosed implementation uses a table to store the query tern correlation data, other types of databases can be used. 

As ftistrated by Figure 1, each entry within the correlation table 137 (two entries shown) has two primary 

components: a "key" term 140, and a "related terms" 1st 142 for that key term. The related terms 1st 142 is a 1st of the 

N (&g. 50) query terms that have appeared within the same query as the keyword with the highest degree of frequency, 
25 and is ordered acconfng to frequency. For example, the entry for the key term COSMOS (ignoring the smgkterm prefixes, 

which are discussed below) is: 

COSMOS: ASTRONOMY, SAGAN, UNIVERSE, 
imficating that ASTRONOMY has appeared together with COSMOS with the highest degree of frequency; SAGAN has 
appeared with COSMOS with the second highest degree of frequency, and so on. Each term that appears within the 
30 related terms ist 142 is deemed to be related to the c or respo nd i ng key term 140 by virtue of the relatively high frequency 
with which the terms have occurred within the same query. 

As further depicted by Figure 1, each related term and each key term 140 preferably includes a single-character 
field prefix which indicates the search field 210, 220, 240 to which the term corresponds. These prefixes may, for 
example, be as follows: A - author, T - title, S - subject R - artist L - label, G - generic In addtton, each related 
35 term is stored together with a correlation score 146 which, in the preferred embodiment mtficates the number of times the 
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reiated term has speared in combination with the key term (within the search fields irritated by their respective field 

prefixes), not counting queries that produced a NULL query result 

Thus, for example, the related term (inducing prefix) S- ASTRONOMY has a correlation score of 410 under the 

key term T-CQSMOS, imScating that four hundred and ten "successful" queries were received (during the time period to 
5 which the table 137 corresponds) which included the combination of COSMOS in the tide field and ASTRONOMY in the 

subject field. Although the field prefixes and correlation scores 146 cany information which is useful to the related terms 

selection process (as described below), such information need not be preserved. 

In operation when a user submits a query to the web site 130, the web server 131 passes the query to the 

query server 132, and the query server applies the query to the bfttographic database 133. If the number of items found 
1 0 exceeds a certain threshold (e^, 50), the query server 1 32 invokes its related term selection process ('selection process") 

139 to attempt to identify one or more related terms to suggest to the user. The selection process may alternatively be 

invoked without regard to whether a certain item count has been reached. 

For each term in the query, the selection process 139 retrieves the respective related terms list 142 (if any) from 

the correlation table 137, and if multiple fists result, merges these fets together. The selection process 139 then takes a 
15 predeter m in ed nuhber (ag. 5) of the related terms from the top of the resulting 1st and passes these "suggested terms" 

to the web server 131 with the query result Isting. Rnaly, the web server 131 generates and returns to the user a query 

result page (Figure 9) which presents the suggested terms to the user for selection. 

In one embedment the related terms ksts are merged by retaining only the intersecting terms (terms which are 

common to a& ists), and dbcardmg al other terms. An important benefit of this method is that any single related term of 
20 the resulting fat can be added to the query without producing a NULL query result To take advantage of thb featue, 

these related terms are preferably presented to the user using an interface method (as in Rgure 9) which requires the user 

to add only one related term to the query per query submbskm. 

The operation of the related term selection process 139 b described in further detai below. 
The dbdosed search engine abo preferably uses historical query submissions and item selections to rank query 
25 results for presentation to the user. A preferred method for ranking query results based on such data b dbdosed in US. 

Patent Application No. 09/041,081 filed March 10, 1998. The search engine abo preferably uses correlations between 

query terms to correct misspelled terms within search queries. A preferred method for correcting spelling errors in search 

queries b dbdosed n U.S, Patent Application No. 09/115,662 entitled "System and Method for Correcting Spettng Errors 

in Search Queries," ffed July 15, 1998. 
30 II. Capturing and Processing of Query Information 

As indicated above, the query term correlation data b preferably generated from the query log 135 using the 

table generation process ("generation process") 136. In the preferred embodiment the table generation process 136 b 

implemented as an off-line process which runs once a day and generates a new query correlation table 137. The 

process effectively generates the table from the M most recent daily query log files 135(1)-135(M). Using a relatively 
35 small M (e.g., 5) tends to produce query term correlation data that heavily reflects short term buying trends (e.g., new 
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releases, weekly best sellers, etc), wWe using a larger M (e.^ 100) tends to produce a more comprehensive 
database. A hybrid approach can alternatively be used in which the table is generated from a large number of log ffles, 
but in which the most recent log fibs are given greater weight For example, queries submitted during the last week 
can be counted three times when generating the correlation scores 146, while queries submitted from one week to one 

5 month ago can be counted only once. In addition, rather than using M consecutive days of query submissions, the 
generation process 1 36 could use sarnies of query submissions from multiple different time periods. 

In the preferred embodiment the building of the query correlation table 137 consists of two primary phases: 
(1) generating daily log files, and (2) periodically parsing and processing these log files to generate the query correlation 
table 1 37. Rather than generate new query term correlation data each time log information becomes available, the 

10 generation process 136 preferably generates and maintains separate query term correlation data for different 
constituent time periods of a relatively short length. In the preferred embodiment the constituent time period is one 
day such that query term correlation data for a single day is stored in a daily results fQe. Each time query term 
correlation data is generated for a new constituent time period, the generation process 136 preferably combines this 
new data with existing data from earlier constituent time periods to form a collective query correlation table with 

15 information covering a longer composite period of tone. This process is depicted in Figure 6 and b deserted further 
below. 

Any of a variety of alternative methods could be used to generate the correlation table 137. For example, the 
generation process 136 could alternatively be implemented to update the query correlation table in real time by 
augmenting the table each time a user submits a successful query. In addition, the table generation process 136 
20 and/or the selection process 139 could take into consideration other types of correlations between query terms, 
including extrinsic or 'static" correlations that are not dependent upon the actions of users. 

A. Generating Dairy Query Log Ffles 

A web server generally maintains a log file detailing all of the requests it has received from web browsers. 
The log ffle is generally organized chronologically and is made up of several entries, each containing information about 
25 a different request 

In accordance with the invention, each time a user performs a search, the web server 131 stores information 
about the submitted query m a log entry of a query log 135. In addition, the web server 131 generates daily query log 
files 135(1)-135(M) which each contain the log entries for a respective day. Figure 3 illustrates four log entries of a 
sample daily query log file 135. Each entry in the log file 135 includes information about a particular HTTP (Hypertext 
30 Transfer Protocol) transaction. The first log entry 310 contains date and time information for when the user 
submitted the query, the user identifier corresponding to the identity of the user (and, in some embodiments, 
identification of the particular interaction with the web server), the name of the web page where the query was 
entered, query terms entered by the user, and the number of the items found for the query. The 1tems_found~ values 
in the log preferably indicate the number items that exactly matched the query. 
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For example, entry 310 nfcates that at 223 AM on February 13, 1998. user 29384719287 submitted the 
query {title - Snow Crash} from the book search page and that two item were found that exactly matched the quay. 
Entry 320 infotes that the same user selected an item having an BBN erf 0 

that this selection was made from a search results page (as b evident from the HTTP_REFERRER ine). Other types of 
5 user actions, such as a request to place an item in a shopping cart or to purchase an ten, are simQarty reflected within the 
quay log 135. As indcated by the above example, a given user's navigation path can be determined by comparing entries 

within the query log 135. 

B. Generating the Correlation Table 

Figure 4 shows the preferred method for generating the correlation table 137. In step 410 the generation 
10 process 136 goes through the most recent daily query log file to identify aB multiple- term queries (Le., queries 
comprised of more than one term) that returned at least one item (Items f ound" > 0) in the query result In step 
420, the generation process 136 correlates each query ("key*) term found in the set of queries to related terms that 
were used with the key term in a particular query, and assigns the related term a correlation score 146. The 
correlation score imficates the frequency with which specific terms have historically appeared together within the 
15 sane query during the period reflected by the daily query log. In step 430, the generation process 136 stores the 
terms coupled with their correlation scores in a daOy results file. In step 440. the generation process 136 merges the 
daiy results fies f or the last M days. Finally, in step 450, the generation process 1 36 creates a new correlation table 
1 37 and replaces the existing query correlation table. 

In the preferred embodiment the generation process 136 is executed once per day at midnight just after the 
20 most recent daily query leg is dosed. In addition, it is assumed that the M-1 most recent daily query logs have already 
been processed by steps 41 0 - 430 of the process to generate respective daiy results fries. 

Each of the steps 410 - 450 of the Figure 4 process will now be described in greater detafl. 

Sten 1: Processing the daily query loo file 
As indicated above, the generation process 136 panes the daily query log file in step 410 to identify and 
25 extract successful multi-term queries. Ignoring the query submissions that produced a NULL query result (items_found 
- 0) provides the important benefits of (1) preventing non-matching terms from being added to the correlation table - 
either as keywords or as related terms - and (2) excluding potentially 'weak* correlations between matching terms 
from consideration. In addition, as deserted below, excluding such "unsuccessful* query submissions enables the 
query terms selection process 139 to be implemented so as to guarantee that the modified query will produce a 
30 successful query result (Le., a query result in which the item count is greater than zero). 

Using the Figure 3 log sequence as an example, the generation process 136 would parse the sample daily 
query log file 135 beginning with log entry 310. The generation process 136 would extract the query for the first log 
entry 310 because the query contains more than one query term and *hems_found* is greater than zero. Next the 
generation process 136 would ignore entry 320 because it contains no query terms. The generation process 136 
35 would then ignore entry 330 because although there are multiple query terms, the number of items found is not greater 
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tbanzero. The generation process 138 wouW next eitract the log entry 340 am! continue through the dairy query log 
fie 135. In soma embodiments, other information such as query field or subsequent actions performed by the user 
may be used to dete rm in e which query submissions to extract or how heavily the queries should be weighted, b 
adtition, other methods may be used to extract the information from the query log. 
5 Step t Correlate terms 

In accordance with the invention, the generation process 136 first takes each extracted query, and for each 
query term, adds a single-character field prefix |"pref ix*) which indicates the search field in which the query term was 
entered. Thus, for example, using the prefixes listed above, the prefix T would be added to the terms "SNOW" and 
"CRASH," in log entry 310, and the prefix "S" would be added to the terms "OUTDOOR" and TRAIL," in log entry 
10 340. During this process, identical terms that were submitted in different search fields are assigned Afferent prefixes 
ami are treated as Afferent terms. For example, the term "SNOW" with a prefix of T would be treated as (Efferent 
from "SNOW" with the prefix "S." In the implementation deserted herein, the key term and related tarns are stored 
without regard to alphabetic case, although case information can alternatively be preserved. 

The generation process 138 then maps each query ("key") term found in the quay and its prefix to other 
15 terms ("related terms") used with that particular query. A correlation score is maintained for each related term in the 
mapping based on the number of times the related term occurred in combination with the key term. The final vakies of 
the correlation scores taken over M days are stored within the query correlation table 137 as the correlation scores 
146 depicted in figure 1. 

For exanple, if a user submits the query "ROUGH GUIDE TO LONDON," in the title field 220, the tarns would 
'20 first be coupled with the prefix "T." The correlation scores in the mapping for "T -GUIDE," "T-TO," and "T-LONDON," 
relative to the key TROUGH," would be incremented. Similarly, the correlation scores for the related terms under the 
keys T-GUIDE," "T-TO," and T-LONDON* would also be incremented 

Figure 5A illustrates an example mapping. In this figure, it is assumed that the generation process 136 has 
already processed many thousands of log entries. For each key term 140 stored in the table 137 A, there is a related 
25 terms fist 142 such that each related torn in 4w list is coupled with a prefix and a value 146 representing the 
correlation score. Each time the key term 140 and a related term 142 are used together in a query, the related tarn's 
value 146 is incremented. 

Assume that the table generation process 136 parses a query "OUTDOOR BIKE TRAIL" submitted in the 
subject field figure 5A shows the mapping before the query is added. In response to the query, the generation 
30 process 136 updates the mapping 137 A producing the mapping 137B shown in figure 5B. The generation process 
136 first looks up the key term "S-0UTD00R" 560 and then looks for the related terms "S-BIKE" 580 and "S-TRAIL" 
590. If the related torn is found, its value is incremented. If the related term is not found, the generation process 136 
adds the related term and assigns it a beginning value. In the example shown in Figure 5B, the values for both "S- 
BIKE* 580 and "S-TRAIL* 590 have been incremented by one. Note that under the key term "T-0UTD00R," the value 
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f or the term "S-TRA1L* was incremented white the value for the term TTRAIL" was not incremented. This is because 
the query was submitted in the subject field, thus affecting only terms with the prefix 'S.' 

In some embedments, certain key terms may be excluded from the mapping if they are frequently used, and 
yet do not further the search refinement process. For example, common articles such as "THE,* "A," TO," and "OF" 
may be excluded from the mapping. White only three partial entries are depicted in Figure 5 A, many thousands of 
entries would be stored in a typical daily results fBe. In the preferred implementation, the mapping for a dafly query log 
file is stored in a B-tree data structure. In other embo&nents, a linked list database, or other type of data structure 
can be used in place of the B-tree. 

In addition, the amount by which the correlation scores are incremented may be increased or decreased 
depending on different kinds of selection actions performed by the users on items identified in query results. These 
may include whether the user displayed additional information about an Hem, how much time the user spent viewing 
the adtfitional information about the item, how many hyperlinks the user followed within the adtfitional information 
about the item, whether the user added the item to his or her shopping basket, and whether the user uhinately 
purchased the Hem. For example, a given query submission can be counted twice (such as by incrementing the 
correlation score by two) if the user subsequently selected an Hem from the query result page, and counted a third 
time if the user then purchased the Hem or added the Hem to the shopping basket These and other types of post- 
search activities reflect the usefulness of the query result and can be extracted from the query log 135 using weB- 

known tracing methods. 

Step 3: Create Oaflv Results file 

Once the mapping is complete, that is, afl entries in the daily query log file have been parsed, the generation 
process 136 creates a daiy resuhs ffle (step 430) to store the B-tree. In other embodiments, the daily results file may 
be generated at an earlier stage of the process, and may be incrementally updated as the parsing occurs. 

Steo 4: Meroe Daily Results fibs 

In step 440, the generation process 1 36 generates the query correlation table 1 37 for a composite period by 
combining the entries of the daiy results fBes for the length of the composite period. As depicted in figure 6, the table 
generation process 136 regenerates the query correlation table 137 on a daily basts from the M most recent daiy 
results files, where Mis a fixed number such as 1 0 or 2a Each day, the daily results file created in step 430 is merged 
with the last M-1 daily results files to produce the query correlation table 137 

For example, in Figure 6, suppose the generation process 136 generates a daily results file for 7-Feb-98 610 
and is set to generate a new query correlation table for the period of the last seven days (M - 7). At the end of 7-Feb- 
98, the generation process 136 would merge the daiy results files from the past seven days for the composHe period 
of I Feb-98 to 7-Feb-98 to form a new query correlation table 137 A. At the end of B-Feb-98, the generation process 
136 would generate a daily results fie for 8-Feb-98 630 and then merge the daily results files from the past seven 
days for the composHe period of 2-Feb-98 to 8-Feb-98 to form a new query correlation table 137B. When the entries 
are merged, the scores of the corresponding entries are combined, for example, by summing them. In one embodiment, 
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the scores in more merit daily results fifes are weighted more heavfly than those scores in less recent daBy results 
files, so that the query term correlation data more heavily reflects recent query submissions over older query 
submissions. . Thi« 'sliding window" approach advantageously produces a query correlation table that is based only 
on recent query submissions, and which thus reflects the current preferences of users. For example, if a relatively 
large number of users have searched for the book Into Thin Air by Jon Krakauer over the past week, the correlations 
between the terms TINTO,' T-THIN," T-AIR," and "AKRAKAUER" will likely be correspondingly high; a query 
which consists of a subset of these terms will thus tend to produce a related terms lists which includes the other 
terns. 

Step 5: Replace Old Query Correlation Table With New Query Correlation Table 
In step 450, once the daiy results fibs have been merged, the generation process 136 sorts the related 
terms lists from highest-to-lowest score. The generation process 136 then truncates the related terms lists to a fixed 
length N (e.g., 50) and stores the query correlation table in a B-tree for efficient lookup. The new query correlation 
table 1 37 B-tree is then cached in RAM (random access memory) in place of the existing query correlation table. 
III. Usino the Table to Generate Related terms 

As indicated above, the query server 132 uses the query correlation table 137 to select related terms to be 
suggested to the user. More specificaly, when a user performs a search winch identifies more than a predetermined 
number of items, the related term selection process ("selection process") 139 returns a query result isting Herns that 
match the query along with a set of related terms generated from the query correlation tabte. An important benefit of 
this method is that it is highly efficient allowing the query result page to be returned without adding appreciable delay. 
Further, the smal delay added by the related terms selection process can be completely avoided by optionally 
generating the related terms concurrently with the search of the biblographic database 133 (rather than watting to 
see if a threshold item count is reached). 

Figure 7 illustrates the sequence of steps performed by the selection process 139. The selection process 
139 first enters a loop (steps 710-740) in which the selection process 139 looks up a query term in the correlation 
tabte ami then retrieves the term's related terms Est 142. This continues for each term in the query. Next, if the 
query has multiple terms, in step 760, the selection process 139 combines the related terms fists. The lists are 
preferably combined by taking the intersection of the related terms lists (Le., deleting terms which do not appear in al 
lists) and summing the correlation scores of the remaining terms. At this point, every term which remains in the list is 
a term which has appeared, in at least one prior, successful query, in combination with every term of the present 
query. Thus, assuming entries have not been deleted from the bibliographic database 133 since the beginning of the 
composite time period (the period to which the table 137 applies), any of these terms can be added individually to the 
present query without producing a NULL query result. In other embodiments, the selection process 139 combines the 
related terms lists by summing the correlation scores of terms common to other related terms lists, without deleting 
any terms. Another implementation might give weighted scores for intersecting terms such that terms appearing in 
more than one related terms list are weighted heavier than those terms appearing only in a single related terms list. 
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In step 770, the selection process 139 selects the X terms with the highest values from the list where X can 
be any desired number. In one embotfiment the selection process 1 39 chooses the top X related terms without regard 
to the field prefixes of these related terms. The selection process may alternatively be configured to select only th 
related terms that correspond to the search fieMJs) of the presem query; for example^ 

subject field 240 Figure 2), the user may be presented orrfy with other subject terms (related terms with the prefix 
"ST.. 

For single-term queries, the selection process 139 thus retrieves the top X terms from the table. Figure BA 
illustrates the related terms that would be generated for a single-term query of TRAIL' In the subject field using the 
mapping from Figure 5B. The selection process 139 would look up the key term "S-TRAIL" 570 and select X related 
terms with the highest X values. For example, suppose the selection process 139 were configured to suggest three 
related terms (X - 3) that correspond to the search fiehKs) of the present quwy. The selection process 139 would then 
look up the key term "S-TRAIL" 570 and display the three related terms with the top three values 810 and with the 
sane prefix as the key term, as ilustrated in Figure 8A~ 

For multiple-term queries, the selection process 1 39 obtains the related terms lists 1 42 for each of the query 
terms, and then takes the intersection of these bsts. Figure 8B illustrates the related torn results for a multiple-term 
query in the subject field of "OUTDOOR TRAIL* using the mapping from Figure 5B. The selection process 139 would 
look up the key terms -S-OUTDOOR- 560 and "S-TRAIL" 570 and see if they have any related terms in common. In 
the mapping, the related terms "S-BIKE." "S-SPORTS," and "S-VACAT10N" are found under the key terms "^ 
OUTDOOR" 560 and "S-TRAIL/ 570; thus "S-BIKE/ "S-SPORTS," and "S-VACAT10N" are the intersecting terms 820 
as illustrated in Figure 8B. The selection process 139 would then display the X intersecting terms with the same 
prefix and the X highest summed correlation scores. If there were less than X intersecting, related terms, the selection 
process 139 could show the intersecting terms with any prefix or use other criteria to generate the remaining related 
terms. For example, the process 139 could take the top Y terms with the highest summed correlation scores from the 
non-intersecting related terms, although suggesting such terms could produce a NULL query result 

As indicated above, the method can alternatively be implemented without preserving or taking into account 
search field information. In addition, the method can be appropriately combined with other techniques for generating 
related terms, including techniques which use the contents of the query result 
IV. Presenting the Related Query terms to the User 

There are a number of different ways to present the related terms to the user, including the conventional 
methods (check boxes and drop-down menus) described above. In implementations which suggest only the intersecting 
related terms, an interface which requires the user to add no more thai one related term per query submission is 
preferably used, so that the modified query will not produce a NULL query result. 

In the preferred embodiment, the related terms are presented though hypertextual finks which combine both 
the original query term(s) and a respective related term. For example, if the user enters thB query "ROUGH* in the 
subject field, three additional hyperlink may be displayed on the query result page, each of which generates a modified 
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searth when clicked on by the user. Each of these folks is formed by combining the user's query with a related term 
fog th»thrp« hypprfak* ™ht he THHJ6H - SU1DE/ "B0UBH - LONDON/ and "RDUM - TERRAINS When the 
user cricks on one of these Inks, the corresponding motffied query is submitted to the search engine. The method thus 
enables the user to select and submit the mofifted query with a single action (e.g^ one cfick of a mouse). As an 
5 inherent benefit of the above-deserted method of generating the related terms, each such (ink produces as least one 

nit. 

Figure 9 illustrates a sanpte query result page 900 in which a user has performed a subject field search on 
the terms 'OUTDOOR TRAIL" and has received a set of three related terms, each of which is incorporated into a 
respective hyperlink 910. The page wfll also typically contain a listing of the query resuh items 920. If the user clicks 

10 on the hyperlink "OUTDOOR TRAIL ■ BIKE. " the search engine will perform a search using the terms "^OUTDOOR," 
"S-TRAIL/ and "S-BIKE," and wi then return the associated items. The query result page 900 may also have search 
fields (not shown) for allowing the user to etfit the query. 

Any of a variety of adcfitional techniques can be used in combination with this byperfink-based interface. For 
example, in one embodiment, the query server 132 automatically selects the related term at the top of related terms 

15 1st (such as the term "bike* in the Figure 9 example), and searches the query resuft to identify a subset of query result 
items that include this related term. The query server 132 thereby effectively applies the 'top* suggested modified 
query to the bibliographic database 133. This process could be repeated using adifitional related terms m the 1st The 
items within the subset can then be cEsplayed to the user at the top of the query result 1st, ami/or can be dbplayed in 
highlighted form. Further, the query server 132 could cache the 1st of items that fall within the subset so that if the 

20 user submits the modified query (such as by cBcking on the ink "OUTDOOR BIKE - TRAIL' in Figure 9), the query 
server could return the resuft of the modified search without having to search the btbiograpfric database. Special tags 
or codes could be embedded within the mocfifteckiuery hyperlinks and passed to the web site 1 30 to enable the query 
server 132 to match the (notified queries to the cached results. 

Although this invention has been described m tarns of certain preferred embodiments, other embo&nents 

25 that are apparent to those of onfinary skill in the art are also within the scope of this invention. Acconfingfy. the 
scope of the present invention is defined only by reference to the appended claims. 

In the claims which follow, reference characters used to denote process steps are provided for convenience 
of description only, and not to imply a particular order for performing the steps. 
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WHAT IS CLAIMED IS: 

1. In a computer system that implements a search engine which is accessMe to a community of 
users, a method of assisting users in refining search queries to enhance discovery, the method comprising the 

computer-implemented steps of: 

(a) processing search queries submitted to the search engine by a plurality of users over a 
period of time to generate query term correlation data, the query term correlation data reflecting frequencies 
with which query terms appear together within the same search query; 

(b) receiving a search query from a user, the search query including at least one query torn; 

(c) using at least the query term correlation data to identify a plurality of additional query 
terms that are deemed to be related to the at least one query term; and 

(d) presenting the plurality of additional query terms to the user for selection to aBow the 

user to refine the search query. 

2. The method of Claim 1 , wherein step (a) comprises generating a data structure which that links key 
terms to related terms based on correlations between occurrences of terns within historical query submissions, aid 
step (c) comprises accessing the data structure to look up related terms. 

3. The method of Claim 1, wherein the search query includes multiple query terms, and step (c) 

comprises the sub-steps of : 

( C 1) for each of the multiple query terms, identifying a set of terms that have previously 

occurred in combination with the respective query term within a successful query; and 

(c2) selecting, as the additional terms, a set of terms that are common to al of the sets 

identified in step (ell 

4. The method of Claim 3, wherein step (d) comprises presenting the adcfitional terms via a user 
interface which inhibits the user from selecting more thai one additional term, the method thereby guaranteeing that a 
modified query produced by adding an ackfitional term will not produce a NULL query result 

5. The method of Claim 4, wherein step (d) comprises presenting the user with a pkirattty of 
hyperlinks which can be selected to submit a modified query, each hyperlink adcfing a ififferent respective additional 
term to the query. 

6. The method of Claim 1, wherein step (a) comprises processing a log that includes search queries 

submitted to the search engine. 

7. The method of Claim 6, wherein the step of processing the log comprises ignoring search queries that 

produced a NULL query result 

8. The method of Claim 6, wherein the step of processing the log comprises applying a time-basal 
biasing function to the log to favor recent search query submissions over aged search query submissions, so that the 
query term correlation data and the adtfitional terms reflects current preferences of the community of users. 
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9. The method of Clam 1, wherein step (a) comprises nxfating the query term correlation data 
suhstantialy in real time as the search queries are received by the search engine. 

10. The method of Claim 1, wherein step (d) comprises presenting the user with a piurafty of hyperinks. 
each hyperfnk being selectable to submit a refined search query which includes a respective additional query term, the 
method thereby enabling the user to initiate a refined search with a single action 

11. The method of Clam 1, wherein step (a) further comprises evaluating post-query-submisston 
actions of users to identify search queries that are (teemed to have produced useful results, and weighting the search 
queries that produced useful results more heavily in generating the correlation data. 

12. The method of Claim 1 f wherein step (c) is performed in paralei with a step of applying the query to a 

database to be searched 

13. The method of Claim 1, further comprising using at least one of the additional terms to select query 
result items to (fispiay at the top of a query result fating. 

14. In a conpiter system that implements a search engine in which related terms are suggested to users 
to f arifitate interactive refinement of search queries, a system for generating related terms, comprising: 

a first process which generates a data structure that links key terms to related terms based at 
least upon correlations between occurrences of terms within historical query submissions; and 

a second process which uses the data structure in combination with a search query submitted by a 
user to select related terms to suggest to the user. 

15. The system of Claim 14, wherein the first process determines the correlations between 
occurrences of terms by at least parsing a log that includes historical query submissions. 

16. The system of Claim 14, wherein the first process generates and updates the data structure 
substantially in real-time as search queries are received by the search engine. 

1 7. The system of Claim 14, wherein the first process regenerates the data structure periodically from 
a log of recent query submissions, so that the related terms suggested to the user reflect current preferences of users. 

18. The system of Claim 14, wherein the first process determines the correlations by at least counting 
a number of times the terms have occurred within the same query. 

19. The system of Claim 14, wherein the first process ignores query submissions that produced NULL 
query results, so that the data structure reflects only successful query submissions. 

20. The system of Clam 1 9, wherein the second process processes a multiple-term search query by at 

least 

(a) for each term in the search query, using the data structure to identify a respective set of 
terms that were previously submitted to the search engine in combination with the term in a successful 
search query; and 

(b) selecting a set of related terms such that each related term is common to each set 
identif ied in step (a). 
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21. The system of Claim 20, further comprising a user interface process which presents the set of 
related terms to the user for selection such that no more than one related term can be added to the search query per 
query submission, the second process thereby ensuring that a modified query produced by adding a related term wffl 

not produce a NULL query result 

22. In a computer system that implements a search engine that is accessible to a community of users, 
a method of assisting users in refining search queries to enhance discovery, the method comprising: 

(a) receiving a search query from a user, the search query including at least one query term; 

|b) usmg at least historical search query data to identify a plurality of additional query terms 
that are deemed to be related to the at least one query term, the historical search query data based on 
previously submitted search queries; and 

(c) presenting the plurality of additional query terms to the user for selection to ado* the 

user to refine the search query. 

23. The method of Claim 22, wherein the search query includes multiple query terms, and step (b) 

comprises the sub-steps of : 

(b1) for each of the multiple query terms, identifying a set of terms that have previously 

occurred in combination with the respective query term within a successful query; and 

(b2) selecting, as the additional query terms, a set of terms that are common to afl of the sets 

identified in step (bU 

24. The method of Claim 23, wherein step (d) comprises using a user interface method which inhibits 
the user from selecting more than one additional term, the method thereby guaranteeing that a modified query 
produced by adding an addttona! term will not produce a NULL query result 

25. In a search engine that suggests related terms to users to f aciitate search refinement, a method of 
generating related terms so as to increase a likelihood that a modified query wiD not produce a NULL query result, the 
method comprising: 

(a) receiving a search query from a user, the query including at least one term; 

(b) for each term in the search query, using historical query information to identify a 
respective set of terms that were previously submitted to the search engine, in combination with the term, in 

a successful search query; 

(c) selecting a set of related terms such that each related term is common to each set 

identified m step (bh and 

(d) presenting the set of related terms to the user for addition to the search query. 

26. The method of Claim 25, wherein step (d) comprises presenting the related terms via a user 
interface which inhibits the user from selecting more than one additional term to add to the query. 
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27. The method of Claim 26. wherein the step (d) comprises presorting the user with a pbrality of 
hyperinks, each hyperfnk beino selectable to submit a refined search query wtich includes a respective related term, the 
method thereby enabGng the user to initiate a refined ssard) with a single action. 

28. The method of Claim 25, wherein the search query comprises multiple query terms. 
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